```{r setup, include=FALSE} #Remind R where to look for libraries .libPaths(c("C:/Users/ale097/Data School/Packages")) #Load libraries library(tidyverse) library(kableExtra) library(knitr) library(edgeR) library(limma) library(cowplot) library(rlang) library(wesanderson) knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file()) knitr::opts_chunk$set( results = 'asis', warning = FALSE, message = FALSE, fig.align = 'center' ) read_csv("data/Redlands_counts.csv") microRNA_counts <- read_csv("data/Redlands_counts.csv") read_csv("data/redlands_horse_metadata.csv") redlands_horse_metadata <- read_csv("data/redlands_horse_metadata.csv") ``` # Introduction I'm Marina and I work on viruses. Before data school I would typically enter my data into Excel or Prism to make bar graphs with errors bars, then communicate these in publications and powerpoint slides. Using R, I can now reproducibly wrangle data imported from Excel, plot with ggplot2 to make really flexible and informative visualizations of the raw data. This has helped me to understand the sources of variation in my data, apply appropriate statistics and make novel visualizations. # My Project My group are looking to find microRNAs that are predictive of viral infection. Such microRNAs could act as early bio-markers of infection, allowing us to isolate the infected individual before they show signs of infection. To find such microRNAs, we set up an experiment in which three horses were infected with Hendra virus and blood was taken at day 0, 1, 3, 5 and 7. RNA was extracted and sequenced, many millions of sequencing reads were then aligned to the 800+ currently annotated microRNAs in the horse genome. This process generates a table where the number of aligned reads for each microRNA are counted for each sample. My goal was to model a relationship between microRNA expression level and time since infection using multi-linear regression.

I began with two tables, one a count matrix for each microRNA and the other a table containing metadata, a snippet of each is shown here.

Just by looking at these tables, there are some challenging aspects to this data.
* 800 microRNAs
* microRNAs with zero counts
* Sample 6 did not work
* High dynamic range of counts
```{r counts_table, out.width='100%', echo = FALSE} knitr::kable(head(microRNA_counts[1:7], n =6), format = "html") %>% kable_styling("striped", full_width = FALSE) ```
```{r metadata_table, out.width = '100%', echo = FALSE} knitr::kable(head(redlands_horse_metadata, n = 6 ), format = "html") %>% kable_styling("striped", full_width = FALSE) ```
```{r tidying data for plotting, out.width = '100%', results = 'hide', echo=FALSE} redlands_horse_metadata_long <- redlands_horse_metadata %>% mutate(day = sub("d","", condition )) %>% mutate(day = as.numeric(day)) %>% select(-condition) lib_size <- microRNA_counts %>% gather(sample, counts, -gene) %>% left_join(redlands_horse_metadata_long, by = "sample") %>% rename(library = sample) %>% mutate(library = sub("s","", library)) %>% mutate(counts_million = counts/1000000) %>% group_by(library) %>% summarise(counts_sum = sum(counts_million)) median <- lib_size %>% filter(library != 6) %>% summarise(median = median(counts_sum)) ``` # Plotting the total counts Plotting the total counts for each sample or library we see that they vary considerably around the median for this experiment (`r median(lib_size$counts_sum)` million counts as shown by the red dashed line). To conduct a filtering step I needed to normalize counts by library size so that sample 11 for example, doesn't loose too many microRNAs because the library wasn't as high yielding as the others. ```{r plotting library size, out.width = '50%', fig.align = 'center'} counts_to_plot <- microRNA_counts %>% gather(sample, counts, -gene) %>% left_join(redlands_horse_metadata_long, by = "sample") %>% rename(library = sample) %>% mutate(library = sub("s","", library)) %>% mutate(counts_million = counts/1000000) %>% full_join(lib_size, by = "library") ggplot(counts_to_plot, aes(y = counts_sum, x = library, color = day, shape = animal)) + geom_point(size = 5) + geom_hline(yintercept=1.808277, linetype="dashed", color = "red") + scale_x_discrete(limits = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)) + labs( x = "Library", y = "Library size (million counts)", title = "Library sizes") + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold"), legend.text = element_text(size = 12), legend.title = element_text(size = 14, face = "bold"), title = element_text(size = 14, face = "bold")) ``` # Filtering out the lowly expressed genes I converted the raw counts to counts per million (CPM) using the `edgeR` package after removing sample 6. Then filtered the lowly expressed microRNAs. The `filterByExpr` function used here removes genes based on sequencing depth and experimental design. For this data set I kept microRNAs with less than 5 cpm in at least 3 samples because we have 3 animals or biological replicates and also removed genes with less than 80 cpm across all samples. Shown here is a density plot of log transformed counts before and after filtering, where the red dashed line represents the 5 cpm cut off chosen for this dataset. log-cpm is used to normalize the increasing variance with increasing count size. ```{r filtering out lowly expressed genes, fig.align = 'center',fig.width = 8, fig.height = 5, out.width='60%'} #creating a DGEList with counts per million after removing s6 microRNA_counts_s6 <- microRNA_counts %>% select(-s6) redlands_horse_metadata_s6 <- redlands_horse_metadata %>% filter(sample != "s6") horse_counts <- DGEList(counts = microRNA_counts_s6[, -1], genes = microRNA_counts_s6[, 1], samples = redlands_horse_metadata_s6) # tidying for plotting cpm_long <- cpm(horse_counts) %>% as.tibble() %>% bind_cols(as.tibble(horse_counts$genes)) %>% gather(sample, cpm, -gene) redlands_horse_metadata_s6_long <- redlands_horse_metadata_long %>% filter(sample != "s6") counts_cpm <- microRNA_counts_s6 %>% gather(sample, counts, -gene) %>% left_join(redlands_horse_metadata_s6_long, by = "sample") %>% bind_cols(cpm_long) %>% select(-sample1, -gene1) %>% rename(library = sample) %>% mutate(library = sub("s","", library)) %>% mutate(counts_million = counts/1000000) %>% full_join(lib_size, by = "library") %>% mutate(log_cpm = log(cpm)) min_count <- 10/median # filtering lowly expressed genes design_matrix <- model.matrix(~ condition + animal, data = redlands_horse_metadata_s6) gene_filter <- filterByExpr(horse_counts, min.count = 6, min.total.count = 80, design = design_matrix) horse_filtered <- horse_counts[gene_filter, , keep.lib.sizes = FALSE] # tidying for plotting filtered_cpm_long <- cpm(horse_filtered) %>% as.tibble() %>% bind_cols(as.tibble(horse_filtered$genes)) %>% gather(sample, cpm, -gene) filtered_cpm <- filtered_cpm_long %>% left_join(cpm_long) %>% left_join(redlands_horse_metadata_s6_long) %>% rename(library = sample) %>% mutate(library = sub("s","", library)) %>% mutate(log_filtered_cpm = log(cpm)) # Plotting plot_unfiltered <- ggplot(counts_cpm, aes(x = log_cpm, color = library)) + geom_density() + geom_vline(xintercept = log(5.53), linetype="dashed", color = "red") + scale_color_discrete(limits = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15)) + labs(title = "Unfiltered microRNAs", x = "log-cpm", y = "density") + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold"), legend.text = element_text(size = 12), legend.title = element_text(size = 14, face = "bold"), title = element_text(size = 14, face = "bold"), plot.margin = margin(6, 2, 6, 2)) plot_filtered <- ggplot(filtered_cpm, aes(x = log_filtered_cpm, color = library)) + geom_density() + geom_vline(xintercept = log(5.53), linetype="dashed", color = "red") + scale_color_discrete(limits = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15)) + labs(title = "Filtered microRNAs", x = "log-cpm", y = "") + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold"), legend.text = element_text(size = 12), legend.title = element_text(size = 14, face = "bold"), title = element_text(size = 14, face = "bold"), plot.margin = margin(6, 0, 6, 0)) plots <- plot_grid( plot_unfiltered + theme(legend.position="none"), plot_filtered + theme(legend.position="none"), hjust = -1, nrow = 1 ) legend <- get_legend( plot_unfiltered + theme(legend.box.margin = margin(0, 0, 0, 6)) ) plot_grid(plots, legend, rel_widths = c(3, .4)) ``` # Trimmed Means of M values normalization I was left with `r nrow(horse_filtered$genes)` from the original `r nrow(horse_counts$genes)` and then needed to perform normalization. It is assumed that all samples should have a similar range and distribution of expression values. However, during the sample preparation or the sequencing process, external factors that are not of biological interest can affect the expression of individual samples. Normalization is required to ensure that the expression distributions of each sample are similar across the entire experiment. `edgeR` employs the Trimmed Means of M values (TMM) in which highly expressed genes and those that have a large variation of expression are excluded, whereupon a weighted average of the subset of genes is used to calculate a normalization factor shown in the table below. ```{r TMM normalisation, result = 'hide'} horse_norm <- calcNormFactors(horse_filtered, method = "TMM") # Table of norm factors norm_factors <- horse_norm$samples %>% select(norm.factors, sample) %>% rename(library = sample) %>% mutate(library = sub("s","", library)) %>% mutate(library = as.numeric(library)) %>% mutate(norm.factors = round(norm.factors, 2)) %>% spread(library, norm.factors) norm_cpm_long <- cpm(horse_norm) %>% as.tibble() %>% bind_cols(as.tibble(horse_norm$genes)) %>% gather(sample, norm_cpm, -gene) norm_cpm <- norm_cpm_long %>% left_join(cpm_long) %>% left_join(redlands_horse_metadata_s6_long) %>% rename(library = sample) %>% mutate(library = sub("s","", library)) %>% mutate(library = as.numeric(library)) %>% mutate(log_cpm = log(cpm)) %>% mutate(log_norm_cpm = log(norm_cpm)) plot_unnorm <- ggplot(norm_cpm, aes(y = log_cpm, x = library, group = library)) + geom_boxplot() + scale_x_discrete(limits = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)) + labs(title = "Unnormalized", y = "log-cpm") + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12, face = "bold"), title = element_text(size = 10, face = "bold"), legend.position = "none") plot_norm <- ggplot(norm_cpm, aes(y = log_norm_cpm, x = library, group = library)) + geom_boxplot() + scale_x_discrete(limits = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)) + labs(title = "Normalized", y = "log-cpm") + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12, face = "bold"), title = element_text(size = 10, face = "bold"), legend.position = "none") ``` ```{r norm_factors_table, out.width='60%', fig.align = 'center', echo = FALSE} knitr::kable(norm_factors, format = "html") %>% kable_styling("striped", full_width = FALSE) ``` ```{r Plotting normalisation, fig.align = 'center', fig.height = 3, fig.width = 7, out.width='80%'} plot_grid(plot_unnorm, plot_norm) ``` # Voom transformation to normalize heteroscedastic counts I could then look at expression of individual genes in the libraries. In genome-wide statistical analysis We can use the extensive between gene comparisons by generalized linear regression to provide a reliable inference of expression. However count measurements are heteroscedastic, i.e. the variance depends on the mean abundance. This can be modeled using the negative binomial distribution. In `voom` we model the mean-variance trend of the log-cpm values at the individual observation level. Voom stands for "variance modeling at the observational level". Note the flat line, indicating high biological variation in this dataset. ```{r voom transformation, fig.height = 3, fig.width = 4} horse_voom <- voom(horse_norm, plot = T, design = design_matrix) ``` # Multi-linear regression The the `voom` weights calculated by the distance from the red curve feed into`limma` so that the variance is no longer dependent on the mean expression level. Here I used the `lmFit` function with model `~ day + animal`. This plot highlights microRNAs with a p-value <0.05 and coefficient more than 0.15 units away from zero. Many of the significant coefficients are close to zero meaning that the day post infection has little impact on expression of most microRNAs. There were however some of interest which have been plotted against day pot infection below. Equine miR-143 looks promising. This microRNA is consitently downregulated in Human Papilloma Virus-induced cancer. ```{r limma modeling using numeric day, out.width = '60%'} metadata_numeric <- redlands_horse_metadata %>% filter(sample != "s6") %>% mutate(day = sub("d","", condition)) %>% select(-condition) %>% mutate(day = as.numeric(day)) horse_counts_day <- DGEList(counts = microRNA_counts_s6[, -1], genes = microRNA_counts_s6[, 1], samples = metadata_numeric) design_matrix_day <- model.matrix(~ day + animal, data = metadata_numeric) gene_filter_day <- filterByExpr(horse_counts_day, min.count = 6, min.total.count = 80, design = design_matrix_day) horse_filtered_day <- horse_counts_day[gene_filter_day, , keep.lib.sizes = FALSE] horse_norm_day <- calcNormFactors(horse_filtered_day) horse_voom_day <- voom(horse_norm_day, design = design_matrix_day) vfit <- lmFit(horse_voom_day, design_matrix_day) efit <- eBayes(vfit) # plotting the significant genes coeff <- topTable(efit, number = 231) %>% as.tibble() %>% select(gene, day, adj.P.Val) %>% mutate(signif = -log10(adj.P.Val)) highlight <- coeff %>% filter(adj.P.Val <= 0.05) %>% filter(day >=0.15 | day <=-0.15) ggplot(coeff, aes(x = day, y = signif)) + geom_point(size = 1) + geom_hline(yintercept = -log10(0.05), linetype="dashed", color = "red") + geom_vline(xintercept = 0.15, linetype = "dashed", color = "blue") + geom_vline(xintercept = -0.15, linetype = "dashed", color = "blue") + geom_text(data = highlight, aes(x = day, y = signif, label = gene), hjust = 0.5, vjust = -0.5, size = 5, check_overlap=TRUE) + coord_cartesian(xlim = c(-1, 1), ylim = c(0, 4)) + labs( x = "coefficient of `~ day + animal' model", y = "-log10(adjusted P value)", title = "Identifying microRNAs that respond to infection") + theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12, face = "bold"), title = element_text(size = 14, face = "bold")) ``` ```{r plotting significant microRNAs, out.width = '60%'} microRNA_expn_day <- horse_voom$E %>% as.tibble %>% bind_cols(as.tibble(horse_voom$genes)) %>% gather(sample, expression, -gene) %>% left_join(metadata_numeric, by = "sample") %>% filter(gene == "eca-miR-143"| gene == "eca-miR-215"|gene =="eca-miR-10b"|gene == "6_70095"|gene =="31_60371"|gene == "24_47765"|gene == "20_38845"|gene =="8_76957") ggplot(microRNA_expn_day, aes(x = day, y = expression, color = animal, group = gene)) + geom_point() + geom_smooth(method = "glm") + facet_wrap(~gene, ncol = 4) + labs(title = "Plotting individual microRNAs", x = "day post Hendra infection", y = "microRNA expression")+ theme(axis.text = element_text(size = 10), axis.title = element_text(size = 12, face = "bold"), title = element_text(size = 14, face = "bold")) ``` # Principle Components Analysis (PCA) for QC Using gene expression levels from my voom transformation, I performed a principle component analysis to show similarities and dissimilarities between samples in an unsupervised manner. Ideally, samples would cluster well within the primary condition of interest, and any sample straying far from its group could be identified and followed up for sources of error or extra variation. Ideally, replicates should lie very close to one another. We see that microRNA expression in horse 3 day 0 (s11) is more affected by a source of variation than any other samples suggesting a technical problem. This sample was only sequenced to a depth of 0.5 million reads. It would be interesting to exclude this sample as an outlier and repeat differential expression analysis. A further concern is that horse 1 and 3 cluster in different locations along the dominant principle component (PC1), suggesting that our biological replicates hold a greater source of variation than conditions. ```{r PCA plotting, out.width='60%', fig.align = 'center'} #spread and convert to a matrix for PCA analysis microRNA_expn <- horse_voom$E %>% as.tibble %>% bind_cols(as.tibble(horse_voom$genes)) %>% gather(sample, expression, -gene) %>% left_join(redlands_horse_metadata, by = "sample") scaled_microRNAs <- microRNA_expn %>% spread(gene, expression) %>% select(-animal, -condition) %>% column_to_rownames("sample") %>% scale() pca_microRNAs <- prcomp(scaled_microRNAs) pca_microRNAs$x %>% as_tibble(rownames = "sample") %>% gather(PC, expression, -sample) %>% left_join(redlands_horse_metadata, by = "sample") %>% spread(PC, expression) %>% ggplot(aes(x = PC1, y = PC2)) + geom_text(aes(label = condition, color = animal), size = 6) + theme(axis.text = element_text(size = 14), axis.title = element_text(size = 16, face = "bold"), title = element_text(size = 14, face = "bold"), legend.text = element_text(size = 14)) + labs(title = "Principle Component Analysis") ``` # My Digital Toolbox I've used tidyverse to bring tabular data outputs from analysis into 'tidy' tables in which each variable forms a column, each observation forms a row and each cell holds a value. This can then feed into ggplot2 to make flexible visualizations. I have also used the RNA-seq packages, edgeR, limma and voom then the publishing packages rmarkdown and knitr. All these were totally new to me! ![](resources/img/tidyverse.png){width=100px} ![](resources/img/edgeR.png){width=270px} ![](resources/img/hex-rmarkdown.png){width=90px} ![](https://raw.githubusercontent.com/tidyverse/ggplot2/master/man/figures/logo.png){width=85px} # Favourite tool I've really enjoyed tidyverse, such a straightforward and standardized way of looking at data, particularly love the filter function! Also rmarkdown, I really want to write papers using rmarkdown. # My time went ... In the early weeks, learning to use a programming language was really hard. There were so many functions to remember and things moved very slowly. Once I could gather and spread and make plots of my data things moved much more quickly and it became addictive! # Next steps I would like to try the same analysis pipeline used here with different data sets. Going forward I'm eager to generate my own data, beginning with FAIR data principles so that my work is findable, accessible, interoperable and reproducible! # My Data School Experience I'm so thankful to be given such a great introduction to programming, data management and app development. I also appreciated the in-depth stats lessons. Already I have been collecting data in a tidy way to save time on analysis and plotting the raw data to understand sources of variation which feeds into improved experimental design. On a personal note I have really enjoyed the feeling of doors opening. Excited to feel connected to a community of data scientists at CSIRO and beyond.